Genomics, Proteomics & Bioinformatics
◐ Oxford University Press (OUP)
All preprints, ranked by how well they match Genomics, Proteomics & Bioinformatics's content profile, based on 171 papers previously published here. The average preprint has a 0.29% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
Kim, Y.; Naghavi, M.; Zhao, J. Y.
Show abstract
The human genome contains more than 4000 genes that are longer than 100 kb. These long genes require more time and resources to make a transcript than shorter genes do. Long genes have also been linked to various human diseases. Specific mechanisms are utilized by long genes to facilitate their transcription and co-transcriptional processes. This results in unique features in their multi-omics profiles. Although these unique profiles are important to understand long genes, a database that provides an integrated view and easy access to the multi-omics profiles of long genes does not exist. We leveraged the publicly accessible multi-omics data and systematically analyzed the genomic conservation, histone modifications, chromatin organization, tissue-specific transcriptome, and single cell transcriptome of 992 protein-coding genes that are longer than 200 kb in the mouse genome. We also examined the evolution history of their gene lengths in 15 species that belong to six Classes and 11 Orders. To share the multi-omics profiles of long genes, we developed a user-friendly and easy-to-use database, LongGeneDB (https://longgenedb.org), for users to search, browse, and download these profiles. LongGeneDB will be a useful data hub for the biomedical research community to understand long genes.
Zou, L.; Jian, Z.; Li, H.; Xu, C.; Wang, Y.; Guo, X.; Song, X.
Show abstract
Despite extensive studies highlight the critical roles of alternative splicing in generating mature circRNA isoforms and enhancing their function diversity, a significant gap remains in the availability of dedicated databases for circRNA alternative splicing events. To bridge this gap, we developed circASbase, a pioneering and comprehensive database that catalogues 452,129 alternative splicing events in 884,047 full-length circRNAs from 581 samples across 13 species, and provides rich annotations to facilitate understanding the splicing regulation of circRNAs. Our findings reveal substantial differences between circRNAs and linear transcripts regarding the distribution and occurrence of alternative splicing events, highlighting the unique regulatory landscape of circRNAs. These unique splicing events result in functional differences of circRNAs by affecting IRES sites, m6A sites, ORFs, protein features, miRNA targets, and more. In summary, circASbase not only covers the urgent need of the research community for data repositories, but also represents a significant advancement in our understanding of circRNA biology. With its user-friendly interfaces and web-based visualization tools, circASbase is poised to become an indispensable resource for researchers exploring the regulatory mechanisms and functional roles of alternative splicing events in circRNAs. This database will continuously drive new insights and discoveries in the field, setting the stage for further advancements in circRNA research. circASbase is available at http://reprod.njmu.edu.cn/cgi-bin/circASbase/
Li, Z.; Chen, Y.; Zhang, Y.; Fang, J.; Xu, Z.; Zhang, H.; Mao, M.; Zhang, L.; Pian, C.
Show abstract
Noncoding RNAs play important roles in transcriptional processes and participate in the regulation of various biological functions, in particular miRNAs and lncRNAs. Despite their importance for several biological functions, the existing signaling pathway databases do not include information on miRNA and lncRNA. Here, we redesigned a novel pathway database named NcPath by integrating and visualizing a total of 178,308 human experimentally-validated miRNA-target interactions (MTIs), 36,537 experimentally-verified lncRNA target interactions (LTIs), and 4,879 experimentally-validated human ceRNA networks across 222 KEGG pathways (including 27 sub-categories). To expand the application potential of the redesigned NcPath database, we identified 553,523 reliable lncRNA-PCG interaction pairs by integrating co-expression relations, ceRNA relations, co-TF-binding interactions, co-Histone-modification interactions, cis-regulation relations and lncPro Tool predictions between lncRNAs and protein-coding genes. In addition, to determine the pathways in which miRNA/lncRNA targets are involved, we performed a KEGG enrichment analysis using an hypergeometric test. The NcPath database also provides information on MTIs/LTIs/ceRNA networks, PubMed IDs, gene annotations and the experimental verification method used. In summary, the NcPath database will serve as an important and continually updated platform that provides annotation and visualization of the pathways on which noncoding RNAs (miRNA and lncRNA) are involved, and provide support to multimodal noncoding RNAs enrichment analysis. The NcPath database is freely accessible at http://ncpath.pianlab.cn/.
Chu, H.; Wang, k.; Cheng, H.; Ma, W.; Dong, L.; Gou, Y.; Yang, J.; Cai, H.
Show abstract
Spatial transcriptomics (ST) has emerged as a powerful tool for unravelling tissue structure and function. However, the continuous development of ST has made it challenging to select and effectively use appropriate analysis tools. To address this issue, we have developed the Spatial Transcriptome Analysis Hub (STASH, http://cailab.labshare.cn:7004), a comprehensive, systematic, and user-friendly database of ST analysis tools. STASH collects and categorizes most of the tools currently available and provides insight into their current status and trends. This can help researchers quickly locate the appropriate tool for their needs, or even guide researchers in the development of better tools.
Zhang, D.; Sun, H.-X.; Zhou, Z.; Jiang, X.; Chen, D.; Zhou, S.; Huang, J.; Qu, S.; Gu, Y.; Zhang, X.; Jin, X.; Gao, Y.; Shen, Y.; Chen, F.
Show abstract
Birth defect, not only poses a major challenge for infant health but also attracts the attention of countless people in the world. Chromosome abnormality directly results in diverse birth defects which are generally deleterious and even lethal. Therefore, gaining molecular regulatory insights into these diseases is important and necessary for effective prenatal screening. Recently, with the advance of next-generation sequencing (NGS) techniques, a myriad of treatises and data associated with these diseases are now constantly produced from different laboratories across the world. To meet the increasing requirements for birth-related data resources, we developed a birth defect multi-omics database (BDdb), freely accessible at http://t21omics.cngb.org and consisting of multi-omics data, circulating free DNA (cfDNA) data, as well as diseases biomarkers. Omics data sets from 138 GSE samples, 5271 GSM samples and 328 entries, and more than 2000 biomarkers of 22 birth-defect diseases in 5 different species were integrated into BDdb, which provides a user-friendly interface for searching, browsing and downloading selected data. Additionally, we re-analyzed and normalized the raw data so that users can also customize the analysis using the data generated from different sources or different High-Throughput Sequencing (HTS) methods. To our knowledge, BDdb is the first comprehensive database associated with birth-defect-related diseases. which would benefit the diagnosis and prevention of birth defects.
Sun, Y.; Zhu, Z.; Zhou, Q.; Wang, Z.; Hou, Y.; Zhou, X.; Li, G.
Show abstract
Cancer is a major global health threat, and early detection of cancer is crucial for improving patient outcomes. DNA methylation in circulating cell-free DNA (cfDNA) has emerged as a promising biomarker for non-invasive cancer diagnosis. However, the integration and utilization of existing cfDNA methylation data have been limited, hindering comprehensive research efforts, especially in the discovery of cfDNA methylation biomarkers. To address this challenge, we introduce cfMethDB, a comprehensive database dedicated to cfDNA methylation in cancer that encompasses 4828 publicly available datasets. Through standardized analysis, we identified 1,048,770 differentially methylated cytosines (DMCs) as candidate biomarkers across seven cancer types. With cfMethDB, we not only identified known cfDNA methylation biomarkers, but also discovered several genes, such as ZIC4, that could be novel biomarkers. Moreover, cfMethDB offers a suite of user-friendly tools, including biomarker evaluation, pan-cancer search and end motif analysis. We hope that cfMethDB will serve as a valuable platform for the discovery of novel cancer cfDNA methylation biomarkers and will facilitate cancer research and clinical applications. cfMethDB is publicly available at: https://cfmethdb.hzau.edu.cn/home.
Matsumoto, H.; Hong, J.
Show abstract
Protozoan parasites cause major infectious diseases and pose persistent challenges to global health, particularly the emergence of drug-resistant strains. Tandem repeats (TRs) and other repetitive architectures are widespread in proteomes, especially in protozoan proteins, where they have been implicated in host-parasite interactions, immune evasion, and antigenicity. However, repeat-containing proteins (RPs) exhibit highly diverse architectures that often extend beyond the simple reiteration of a single motif, making comprehensive and quantitative characterization challenging. In this study, we performed bioinformatics analysis of repeat architectures in protozoan proteins. In addition to the established repeat-detection approaches, we developed a new algorithm, Drepper, which quantifies repeat-architecture complexity. By integrating diverse repeat-related features, we clustered RPs across species and identified distinct groups associated with parasite lineages. Notably, we detected a Plasmodium-specific RP cluster and a Trypanosoma/Leishmania-specific RP cluster; both were characterized by large repeat regions but exhibited contrasting repeat-structure complexity. The Plasmodium-specific RPs showed high complexity, whereas the Trypanosoma/Leishmania-specific RPs displayed significantly low complexity. Functional enrichment analyses indicated that these lineage-associated clusters were enriched in parasite-specific factors. Furthermore, evolutionary analyses suggested that low-complexity repeat architectures may be actively maintained through concerted evolution. Taken together, our results reveal lineage-specific strategies in protozoan repeat architectures and provide a quantitative framework for studying their biological and evolutionary roles.
Xiaoling, Z.; Feng, L.; Guiyuan, T.; Li, Y.; Jiaxin, Z.; Wanqi, M.; Yu, Z.; Congxue, H.; Li, X.; Yinqi, X.; Chunlong, Z.
Show abstract
Brain is the most complex organ of living organisms, as the celebrated cells in the brain, microglia play an indispensable role in the brains immune microenvironment. Microglia have critical roles not only in neural development and homeostasis, but also in neurodegenerative diseases and malignant of the central nervous system. However, little is known about the dynamic characteristics of microglia during development or disease conditions. Recently, the single-cell RNA sequencing technologies have become possible to characterize the heterogeneity of immune system in brain. But it posed computational challenges on integrating and utilizing the massive published datasets to dissect the spatiotemporal characterization of microglia. Here, we present microgliaST (bio-bigdata.hrbmu.edu.cn/MST), a database consisting of single-cell microglia transcriptomes across multiple brain regions and developmental periods. Based on high-quality microglia markers collected from published papers, we annotated and constructed human and mouse transcriptomic profiles of 273,374 microglias, comprising 12 regions, 12 periods and 3 conditions (normal, disease, treatment). In addition, MicrogliaST provides multiple analytical tools to elucidate the landscape of microglia under disorder conditions, conduct personalized difference analysis and spatiotemporal dynamic analysis. More importantly, microgliaST paves an ingenious way to the study of brain environment, and also provides insights into clinical therapy assessments.
Yang, H.; Park, B.; Park, J.; Lee, J.; Jang, H. S.; Lee, N.; Yoo, H.
Show abstract
Biomedical databases grow by more than a thousand new publications every day. The large volume of biomedical literature that is being published at an unprecedented rate hinders the discovery of relevant knowledge from keywords of interest to gather new insights and form hypotheses. A text-mining tool, PubTator, helps to automatically annotate bioentities, such as species, chemicals, genes, and diseases, from PubMed abstracts and full-text articles. However, the manual re-organization and analysis of bioentities is a non-trivial and highly time-consuming task. ChexMix was designed to extract the unique identifiers of bioentities from query results. Herein, ChexMix was used to construct a taxonomic tree with allied species among Korean native plants and to extract the medical subject headings unique identifier of the bioentities, which co-occurred with the keywords in the same literature. ChexMix discovered the allied species related to a keyword of interest and experimentally proved its usefulness for multi-species analysis.
Qin, E.; Pan, X.; Shen, H.-B.
Show abstract
Many diseases are closely associated with over- or under-expressed genes. In order to cover more up to date associations between over- or under-expressed genes and various diseases, we develop an updated database OUGENE 2.0 for disease-associated over- and under-expressed genes by automatic full-text mining. In total, the new OUGene 2.0 includes 197,236 associations between 12,672 diseases and 11,542 over- or under-expressed genes, which increases by about 5 folds compared to the previous version of OUGene. A novel method for rescaling the raw score based on support evidences is designed to prioritize the mined associations. OUGene 2.0 provides a holistic view of disease-gene associations and it supports user-friendly data exploration at www.csbio.sjtu.edu.cn/bioinf/OUGene for academic use.
Shujia, H.; Wang, C.; Huang, M.; Lu, J.; He, J.-R.; Lin, S.; Liu, S.; Xia, H.; Qiu, X.
Show abstract
High-quality genome databases derived from large-scale family-based birth cohorts are essential resources for investigating genetic determinants affecting early-life traits and the impact of early-life environments on the health of both parents and offspring. Here, we have established the genome database of Born in Guangzhou Cohort Study (BIGCS), name as GDBIG, which represents the first birth-cohort-based genomic database in China and is designed to facilitate generational genetic research, based on the Phase I results of BIGCS, that contains the low-coverage ([~]6.63x) whole-genome sequencing (WGS) data and numerous pregnancy phenotypes of 4,053 Chinese participants. The participants were from 30 out of 34 administrative divisions of China, covering Han and 12 minority ethnic groups. Currently, GDBIG provides a comprehensive range of services, including allele frequency inquiries for 56.23 million variants across two generations, a genotype imputation server featuring a high-quality family-based reference panel, and a GWAS meta-analysis interface for various maternal and infant phenotypes. The GDBIG database addresses the dearth of Asian birth-cohort-based genomic resources and provides a valuable platform for conducting genetic analysis online or through application programming interfaces at http://gdbig.bigcs.com.cn/.
Xia, X.-Q.; Guo, C.; Ye, W.; You, D.; Zhang, W.; Cheng, Y.; Shi, M.
Show abstract
With the advancement of single-cell sequencing technology in recent years, an increasing number of researchers have turned their attention to the study of cell heterogeneity. In this study, we created a fish single-cell transcriptome database centered on zebrafish (Danio rerio). FishSCT currently contains single-cell transcriptomic data on zebrafish and 8 other fish species. We used a unified pipeline to analyze 129 datasets from 44 projects from SRA and GEO, resulting in 964/26,965 marker/potential marker information for 245 cell types, as well as expression profiles at single-cell resolution. There are 117 zebrafish datasets in total, covering 25 different types of tissues/organs at 36 different time points during the growth and development stages. This is currently the largest and most comprehensive online resource for zebrafish single-cell transcriptome data, as well as the only database dedicated to the collection of marker gene information of specific cell type and expression profiles at single-cell resolution for a variety of fish. A user-friendly web interface for information browsing, cell type identification, and expression profile visualization has been developed to meet the basic demand in related studies on fish transcriptome at the single-cell resolution.
Suzuki, T.; Ninomiya, K.; Funayama, T.; Okamura, Y.; Tadaka, S.; the Tohoku Medical Megabank Project Study Group, ; Kinoshita, K.; Yamamoto, M.; Kure, S.; Kikuchi, A.; Tamiya, G.; Takayama, J.
Show abstract
Next-generation sequencing (NGS) has become widely available and is routinely used in basic research and clinical practice. The reference genome sequence is an essential resource for NGS analysis, and several population-specific reference genomes have recently been constructed to provide a choice to deal with the vast genetic diversity of human samples. However, resources supporting population-specific references are insufficient, and it is burdensome to perform analysis using these reference genomes. Here, we constructed a set of resources to support NGS analysis using the Japanese reference genome sequence, JG. We created resources for variant calling, gene and repeat element annotations, variant-effect prediction, read mappability, and RNA-seq analysis. We also provide a resource for reference coordinate conversion for further annotation enrichment. We then provide a variant calling protocol using JG-based resources. Our resources provide a guide to prepare sufficient resources for the use of population-specific reference genomes and can facilitate the migration of reference genomes.
Liu, X.; Xu, K.; Tao, X.; Bo, X.; Chang, C.
Show abstract
Functional enrichment analysis has been widely used to help researchers obtain biological insights from -omics data. However, the results are often redundant and difficult to digest. The key is developing tools to help users explore the relationships between the enriched terms, remove the redundant terms, and finally select representative terms. However, existing tools hardly make a good integration between enrichment analysis and representative terms selection in a biological-friendly manner. Here, we developed a biologist-oriented web server named EnrichMiner to provide a one-stop solution. It is a complete analysis pipeline from a gene list or a ranked gene table to published-style figures. More importantly, it provides user-friendly interfaces and rich interactive operations to help users explore the term relationships and remove redundancy. EnrichMiner has been integrated into the ExpressVis platform, and is freely accessible at https://omicsmining.ncpsb.org.cn/ExpressVis/EnrichMiner and does not require login.
Gao, L.-z.; Zhang, F.; Feng, L.-y.; Lin, P.-f.; Jia, J.-j.
Show abstract
Camellia crapnelliana Tutch., belonging to the Theaceae family, is an excellent landscape tree species with high ornamental value. It is particularly an important woody oil-bearing plant with high ecological, economic, and medicinal values. Here, we first report the chromosome-scale reference genome of C. crapnelliana with integrated technologies of SMRT, Hi-C and Illumina sequencing platforms. The genome assembly had a total length of [~]2.94 Gb with contig N50 of [~]67.5 Mb, and [~]96.34% of contigs were assigned to 15 chromosomes. In total, we predicted 37,390 protein-coding genes, [~]99.00% of which were functionally annotated. Comparative genomic analysis showed that the C. crapnelliana genome underwent a whole-genome duplication event shared across the Camellia species and an {gamma} -WGT event that was shared by all core eudicot plants. Furthermore, we identified the major genes involved in the biosynthesis of oleic acids and terpenoids in C. crapnelliana. The chromosome-scale genome of C. crapnelliana will become valuable resources for understanding the genetic basis of the fatty acid biosynthesis, and greatly facilitate the exploration and conservation of C. crapnelliana.
Tu, Y.-T.; Chen, C.-A.; Gendron, J.; Lee, C.-M.
Show abstract
Protein ubiquitination, mediated by E3 ubiquitin ligases, is a critical regulatory mechanism of eukaryotic cellular processes, including circadian clock function. However, identifying E3-substrate pairs remains technically challenging due to substrate instability and the genetic redundancy of E3s. To overcome these limitations, we developed a high-throughput yeast two-hybrid E3 decoy screening platform, enabling systematic mapping of E3-substrate interactions. Using a library of 283 Arabidopsis F-box and U-box E3 decoys, we screened 21 core circadian clock regulators and identified 77 potential E3-substrate interaction pairs involving 56 E3s and 16 clock proteins. Focusing on high-confidence hits, we demonstrated that PUB18 physically interacts with the central clock regulators LHY and JMJD5 and promotes their ubiquitination in planta. Genetic analyses further revealed that PUB18 and its homolog PUB19 function redundantly in circadian clock regulation. This study establishes the E3 decoy yeast two-hybrid platform as a versatile and scalable tool for dissecting ubiquitination networks in broad biological processes.
Chen, N.; Fu, W.; Zhao, J.; Shen, J.; Chen, Q.; Zheng, Z.; Chen, H.; Sonstegard, T. S.; Lei, C.; Jiang, Y.
Show abstract
Next-generation sequencing has yielded a vast amount of cattle genomic data for the global characterization of population genetic diversity and the identification of regions of the genome under natural and artificial selection. However, efficient storage, querying and visualization of such large datasets remain challenging. Here, we developed a comprehensive Bovine Genome Variation Database (BGVD, http://animal.nwsuaf.edu.cn/BosVar) that provides six main functionalities: Gene Search, Variation Search, Genomic Signature Search, Genome Browser, Alignment Search Tools and the Genome Coordinate Conversion Tool. The BGVD contains information on genomic variations comprising [~]60.44 M SNPs, [~]6.86 M indels, 76,634 CNV regions and signatures of selective sweeps in 432 samples from modern cattle worldwide. Users can quickly retrieve distribution patterns of these variations for 54 cattle breeds through an interactive source of breed origin map using a given gene symbol or genomic region for any of the three versions of the bovine reference genomes (ARS-UCD1.2, UMD3.1.1, and Btau 5.0.1). Signals of selection are displayed as Manhattan plots and Genome Browser tracks. To further investigate and visualize the relationships between variants and signatures of selection, the Genome Browser integrates all variations, selection data and resources from NCBI, the UCSC Genome Browser and AnimalQTLdb. Collectively, all these features make the BGVD a useful archive for in-depth data mining and analyses of cattle biology and cattle breeding on a global scale.
Sun, Y.; Chen, S.; Peng, Y.; Zhang, X.; Jiang, T.; Fang, B.; Zhang, P.; Li, Y.; Ren, Y.
Show abstract
Cell-cell communication is a frequently used analysis approach in single cell RNA and spatial transcriptomics, and many tools like CellPhoneDB, CellChat and stLearn have been developed. Ligand-receptor interactions are the core of cell-cell communication analysis. Since receptor-ligand and even protein-protein interactions were focus on humans and mice research, curated human and mouse receptor-ligand databases have been established, cell-cell interactions for these two species single-cell RNA sequencing data can be directly analyzed. However, for rats, chickens, pigs, monkeys, and other species, cell-cell interaction analysis is often implemented through orthologous gene mapping, due to the lack of curated ligand-receptor interaction databases for these species. We collected cell-cell interaction data mainly from KEGG for rats, chickens, pigs, and monkeys, and extended the data from Reactome and IntAct. Then, by using CellChatV2 with our collected rat ligand-receptor interactions and CellChatV2s own mouse data, respectively, we analyzed 10x Genomics rat public scRNA data, and found that 70 significantly ligand-receptor interactions from the mouse analysis result were also significantly in rats. We also obtained some chicken, pig, and monkey scRNA data from published literature, and analyzed cell-cell interactions using our collected ligand-receptor interactions for these species, and it was proved that our data is reliable and useful. Lastly, we have transformed the ligand-receptor interactions for rat, chicken, pig, and monkey species into the CellChatDB format, which enables swift and straightforward analysis of cell-cell communication in single-cell and spatial data of these four species. All the ligand-receptor interaction datasets for rats, chickens, pigs, and monkeys, as well as the program codes, are available at https://github.com/qingchen36/ligand-receptor. Using our program, one can rapidly obtain receptor-ligand interaction data for other species.
Zhou, H.-Y.; Cheng, Y.-X.; Xu, L.; Li, J.-Y.; Tao, C.-Y.; Ji, C.-Y.; Han, N.; Yang, R.; Li, Y.; Wu, A.
Show abstract
Recently, patients co-infected by two SARS-CoV-2 lineages have been sporadically reported. Concerns are raised because previous studies have demonstrated co-infection may contribute to the recombination of RNA viruses and cause severe clinic symptoms. In this study, we have estimated the compositional lineage(s), tendentiousness, and frequency of co-infection events in population from a large-scale genomic analysis for SARS-CoV-2 patients. SARS-CoV-2 lineage(s) infected in each sample have been recognized from the assignment of within-host site variations into lineage-defined feature variations by introducing a hypergeometric distribution method. Of all the 29,993 samples, 53 (~0.18%) co-infection events have been identified. Apart from 52 co-infections with two SARS-CoV-2 lineages, one sample with co-infections of three SARS-CoV-2 lineages was firstly identified. As expected, the co-infection events mainly happened in the regions where have co-existed more than two dominant SARS-CoV-2 lineages. However, co-infection of two sub-lineages in Delta lineage were detected as well. Our results provide a useful reference framework for the high throughput detecting of SARS-CoV-2 co-infection events in the Next Generation Sequencing (NGS) data. Although low in average rate, the co-infection events showed an increasing tendency with the increased diversity of SARS-CoV-2. And considering the large base of SARS-CoV-2 infections globally, co-infected patients would be a nonnegligible population. Thus, more clinical research is urgently needed on these patients.
Liu, Y.; Song, F.; Li, Z.; Chen, L.; Xu, Y.; Sun, H.; Chang, Y.
Show abstract
During the course of cancer treatment, both efficacy and the adverse effects of drugs on patients should be taken into account. Although some public databases and modeling frameworks have been developed through studies on drug response, the negative effects of drugs are always neglected. Furthermore, most of them only considered the ramifications of the drug on the cell line, but the effects on the patient still require a huge amount of work to integrate data from various databases and calculations, especially in relation to precision treatment. In order to address these issues, we developed the DBPOM (http://www.dbpom.net/, a comprehensive database of pharmaco-omics for cancer precision medicine), which explores various drugs efficacy levels by calculating their potency in reverse, or enhancing cancer-associated gene expression changes. When compared with existing databases, the DBPOM could estimate the effectiveness of a drug on individual patients through the mapping of various cell lines to each person according to their genetic mutation similarities. The DBPOM is an easy-to-use and one-stop database for clinicians and drug researchers to search and analyze the overall effect of a drug or a drug combination on cancer patients as well as the biological functions that they target. We anticipate that DBPOM will become an important resource and analysis platform for drug development, drug mechanism studies and the discovery of new therapies.